Term Deposit Sale
Attribute information Input variables:
Bank client data:
Related to previous contact:
Other attributes:
Output variable (desired target):
import warnings
warnings.filterwarnings('ignore')
#--------------------------------------------------
import pandas as pd
import numpy as np
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#--------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
#rom IPython.display import Image
from sklearn import tree
from os import system
#--------------------------------------------------
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score, roc_curve, classification_report
# Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
# import Dataset
bd = pd.read_csv('bank-full.csv')
#checking how many columns and how many total entries(rows)
bd.shape
#checking columns
bd.columns
#checking the first 10 data entries
bd.head(10)
#checking the last 10 data entries
bd.tail(10)
At first inspection, there are seemingly invalid entries in certain columns. "unknown" entries can be found in columns "job", "education", "contact" and "poutcome" which is considered as an invalid entries. "poutcome" column also contains entry word "other". Also there are negative values in column pdays. We will see in the next cell what other columns contains erroneous entries.
# Let us count categorical variables and view each entries per column.
# by doing so, we then see more erroneous entries and its total count.
categorical_cols = ['job','marital', 'education', 'default','housing', 'loan','contact','poutcome','Target','previous']
for i in categorical_cols:
x = bd[i].value_counts()
print(i)
print(x)
print("")
Let us further investigate the Dataset on the next cells...
#Let us check datatypes/object
bd.info()
#checking the dataset's descriptive statistic summary
bd.describe()
#transpose view of the dataset's descriptive statistic summary
bd.describe().transpose()
#checking for unique entries
bd.nunique()
#checking for empty rows/incorrect imputations
bd.isna().values.any()
#null values
bd.isnull().sum()
#rechecking how many total entries(rows) per columns
bd.shape
Based from our query:
a. Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers.
b. Strategies to address the different data challenges such as data pollution, outlier’s treatment and missing values treatment.
c. Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots.
UNIVARIATE ANALYSIS
Panda-Profiling Report
from pandas_profiling import ProfileReport
ProfileReport(bd)
Let us deal beforehand with the erroneous data entry. Let us process Yes/No Entries by converting them into 1/0 correspondingly. and let us clean negatively signed entries
# coverting "yes" to 1 and "no" to 0
categorical_cols1=['default','housing','loan','Target']
for cat in bd[categorical_cols1]:
bd[cat]=bd[cat].apply(lambda x: 1 if x=='yes' else 0)
#cleaning the negative values observed with pdays attribute
bd['pdays'] = bd['pdays'].abs()
Let us now do some Exploratory Data Analysis
sns.pairplot(bd)
On the following cells, we will try to evaluate the relationship of each variables to the Target variable. Target variable tells us if the client has subscribed (Y) a term deposit or No (0).
# Let us check the Age distribution
ageave= round(bd.age.mean(),2)
sns.distplot(bd.age, kde=False, bins=20, color='b');
print('Customers Average age :',ageave)
# Let us check age variable in relation to Target variable
plt.figure(figsize=(15,6))
ax=sns.countplot(x='age', data=bd, hue='Target')
plt.setp(ax.get_xticklabels(), rotation=90);
# Let us check jobs variable in relation to Target variable
plt.figure(figsize=(15,6))
ax=sns.countplot(x='job', data=bd, hue='Target')
plt.setp(ax.get_xticklabels(), rotation=45);
# Let us check the marital variable in relation to the Target variable
sns.countplot(x='marital', data=bd, hue='Target');
# Let us check the education variable in rlation to Target
plt.figure(figsize=(15,6))
ax=sns.countplot(x='education', data=bd, hue='Target')
plt.setp(ax.get_xticklabels(), rotation=45);
# Let us check the housing variable in relation to Target variable
sns.countplot(x='default', data=bd, hue='Target');
sy = len(bd[bd.default==1])
sn = len(bd[bd.default==0])
print('Percent of customer with credit default :', round(sy/(sy+sn)*100,2),"%")
print('Percent of customer with no credit default :', round(sn/(sy+sn)*100,2),"%")
cd = bd.Target[(bd.default==1)&(bd.Target==1)].count()
print('Number of Customers with credit default that have term deposit :', cd)
cd2 = bd.Target[(bd.default==0)&(bd.Target==1)].count()
print('Number of Customers with no credit default that have term deposit :', cd2)
# Let us check the housing variable in relation to Target variable
sns.countplot(x='housing', data=bd, hue='Target');
#Let us check loan variable in relation to Target variable
sns.countplot(x='loan', data=bd, hue='Target');
#Let us check balance variable in relation to Target variable
sns.distplot(bd.balance, kde=False, bins=25, color='g');
ban_median = bd.balance.median()
ban_mean = bd.balance.mean()
print('Average customer balance :', round(ban_mean,2))
print('Right skewed distribution (mean - median is positive) :',round(ban_mean-ban_median,2))
negba = bd.balance[bd.balance<0].count()
print('Customer with negative balance account :',negba, '/', round(negba/bd.shape[0]*100,2),"%")
zerba=bd.balance[bd.balance==0].count()
print('Customer with zero balance account :', zerba, '/', round(zerba/bd.shape[0]*100,2),"%")
posba = bd.balance[(bd.balance>0) & (bd.balance<=5000)].count()
print('Customer with balance account between <0, 5,000] :', posba, '/', round(posba/bd.shape[0]*100,2),"%")
higba = bd.balance[bd.balance>5000].count()
print('Customer with balance account higher than 5,000 :', higba, '/', round(higba/bd.shape[0]*100,2),"%")
#Let us check contact variable in relation to Target variable
sns.countplot(x='contact', data=bd, hue='Target');
#Let us check month variable in relation to Target variable
pd.crosstab(bd.month,bd.Target).plot(kind='bar', figsize=(10,6));
plt.title('Frequency of customer contacted by month during the year');
plt.xlabel('Month of the year');
plt.ylabel('Frequency');
#Let us check day variable in relation to Target variable
pd.crosstab(bd.day,bd.Target).plot(kind='bar',figsize=(10,6));
plt.title('Day in the month contact to customer');
plt.xlabel('Day in the month');
plt.ylabel('Frequency of contact');
#Let us check duration variable in relation to Target variable
sns.distplot(bd.duration, kde=False, bins=100, color='g');
ban_median = bd.duration.median()
ban_mean = round(bd.duration.mean(),2)
print('Average contact duration time in seconds :',ban_mean)
print('Data shows a right skewed distribution',round(ban_mean-ban_median,2))
#Let us check campaign variable in relation to Target Variable
print('Customer were contacted in Average :',round(bd.campaign.mean(),2),'times')
bd.campaign.hist(bins=20, color='orange');
plt.xlabel('# of times Customers were contacted during this campaign');
#Let us check pdays variable in relation to Target variable
cnc = bd.pdays[bd.pdays==1].count()
cc = bd.shape[0]-cnc
bd.pdays.sort_values(ignore_index=True).plot();
plt.xlabel('# Customers contacted from previous campaign');
plt.ylabel('# days passed after last contact');
print('Total number of customer contacted from previous campaign :',cc, 'or', round(cc/bd.shape[0]*100,2),'%')
print('Total number of customer that was not contacted from previous campaign:',cnc, 'or', round(cnc/bd.shape[0]*100,2),'%')
#previous: number of contacts performed before this campaign and for this client
pc = bd.previous[bd.previous>0].mean()
print('Average times contacted by Customer before this campaign :',round(pc,2))
bd.previous[bd.previous>0].hist(bins=30);
plt.xlabel('# of times Customers were contacted before this campaign');
#Let us check pout variable in relation to Target variable
pd.crosstab(bd.poutcome, bd.Target).plot(kind='bar')
plt.xlabel('outcome of the previous marketing campaign');
plt.ylabel('Frequency');
#Evaluating the Target variable
sy = len(bd.Target[bd.Target==1])
sn = len(bd.Target[bd.Target==0])
print('% number of subscriptors to term deposit :', round(sy/(sy+sn)*100,2),'%')
print('% number of not subscriptors to term deposit :', round(sn/(sy+sn)*100,2),'%')
sns.countplot(x='Target', data=bd);
ANALYSIS
a. Bi-variate analysis between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes.
b. Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots
bd["Target_Int"] = bd["Target"].apply(lambda x: 0 if x == 'no' else 1)
bd.head(10)
#Let us check the mean value of all variable in relation to Target variable
round(bd.groupby('Target').mean(),2)
# default
table=pd.crosstab(bd.default, bd.Target)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of default vs Term Deposit');
plt.xlabel('default');
plt.ylabel('Proportion of Customers');
#Job Status
table=pd.crosstab(bd.job, bd.Target)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True);
plt.title('Stacked Bar Chart of job vs Term deposit');
plt.xlabel('Job');
plt.ylabel('Proportion of Customers');
# Marital Status
table=pd.crosstab(bd.marital, bd.Target)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True);
plt.title('Stacked Bar Chart of marital vs Term Deposit');
plt.xlabel('Marital Status');
plt.ylabel('Proportion of Customers');
# Education Status
table=pd.crosstab(bd.education, bd.Target)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of education vs Term Deposit');
plt.xlabel('Education');
plt.ylabel('Proportion of Customers');
sns.boxplot(bd['balance'])
# from data summary, ProfileReport and above plot, balance has outliers
# balance0
#(mean)1362.272058
#(min)-8019.0
#(max) 102127.0
from scipy.stats import zscore
balance_outliers = zscore(bd['balance'])
print(balance_outliers)
sns.boxplot(bd['Target'],bd['balance'])
sns.boxplot(bd['Target'],bd['age'])
for i in ['age','balance']:
sns.distplot(bd[i])
plt.show()
corr = bd.corr()
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.figure(figsize=(13,7))
# create a mask so we only see the correlation values once
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask, 1)] = True
a = sns.heatmap(corr,mask=mask, annot=True, fmt='.2f')
rotx = a.set_xticklabels(a.get_xticklabels(), rotation=90)
roty = a.set_yticklabels(a.get_yticklabels(), rotation=30)
Target variable has relatively strong correlation with duration variable at 0.39
sns.pairplot(bd[['age','balance','duration','campaign']]);
#changing objet to categorical data type
col_obj =bd.select_dtypes(include ='object')
for nun in col_obj:
bd[nun] = bd[nun].astype('category')
data = pd.get_dummies(bd, columns=['job','marital','education','default','housing','loan','contact','day','month','poutcome'])
data.columns
X = data.drop(columns=['Target'])
X.shape
y = data['Target']
y.shape
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# checking splitting data 70:30
print("{0:0.2f}% data is in training set".format((len(X_train)/len(data.index)) * 100), '- total of', len(X_train))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(data.index)) * 100),'- total of', len(X_test))
print("Total records: ", len(data.index))
print("Training records: ", len(x_train))
print("Testing records: ",len(x_test))
print("{0:0.2f}% data is in training set".format((len(x_train)/len(data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data.index)) * 100))
pos = y_train[y_train == 1].count()
neg = y_train[y_train == 0].count()
print("Positive = ", pos)
print("Negative = ", neg)
perc = pos/len(y_train)
print(perc)
pos = y_test[y_test == 1].count()
neg = y_test[y_test == 0].count()
print("Positive = ", pos)
print("Negative = ", neg)
perc = pos/len(y_test)
print(perc)
sc_tra,sc_tes,TPf,TNf,FPf,FNf = 0,0,0,0,0,0
Rec,Spec,Pres,Accu,f1,logit_roc_auc = 0,0,0,0,0,0
def score_confmetrics(model, X_train, y_train, X_test, y_test):
global sc_tra,sc_tes,TPf,TNf,FPf,FNf,Rec,Spec,Pres,Accu,f1,logit_roc_auc
ml_train = model.fit(X_train, y_train)
sc_tra = round(ml_train.score(X_train, y_train),4)
sc_tes = round(ml_train.score(X_test, y_test),4)
# prediction
y_predict = model.predict(X_test)
#score the model
print('\n' * 1)
print('Model score_train :',round(sc_tra,4))
print('Model score_test :',round(sc_tes,4))
# confusion metrics
confusion=confusion_matrix(y_test, y_predict)
sns.heatmap(confusion, annot=True, fmt='.2f', xticklabels = ["No", "Yes"] , yticklabels = ["No", "Yes"] )
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
#Metric data
print('\n' * 1)
TPf = confusion[1,1]
TNf = confusion[0,0]
FPf = confusion[1,0]
FNf = confusion[0,1]
# classification report
Rec = round(TPf/float(TPf+FNf),4)
Spec = round(TNf/float(TNf+FPf),4)
Pres = round(TPf/float(TPf+FPf),4)
#Accu = round((TPf+TNf)/float(TPf+TNf+FPf+FNf),4)
f1 = round(2*Pres*Rec/(Pres+Rec),4)
print(classification_report(y_predict, y_test))
# ROC Curve
logit_roc_auc = round(roc_auc_score(y_test, model.predict(X_test)),4)
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1-Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
# output
return(sc_tra,sc_tes,TPf,TNf,FPf,FNf,Rec,Spec,Pres,Accu,f1,logit_roc_auc)
Logistic Regression
#Logistic Regression - Hyper Parameters Optimizaton
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
clas_LR = LogisticRegression()
params_LR={
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'penalty' :['l1', 'l2', 'elasticnet', 'none'],
'max_iter' :[100, 1000, 2000, 3000, 4000]
}
random_searchLR=RandomizedSearchCV(clas_LR,param_distributions=params_LR,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchLR.fit(X_train, y_train);
random_searchLR.best_params_
# Fit the model on train
model_LR = LogisticRegression(solver= 'liblinear', penalty= 'l1', max_iter= 4000, random_state=1)
# calling score_confmetrics funtion
model = model_LR
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_LR, ste_LR, TP_LR, TN_LR,FP_LR, FN_LR = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_LR,S_LR,P_LR,A_LR,f1_LR,l_LR = Rec,Spec,Pres,Accu,f1,logit_roc_auc
Building Decision Tree Model
# decision tree number
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
# calling score_confmetrics funtion
model = dTree
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_DT, ste_DT, TP_DT, TN_DT,FP_DT, FN_DT = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_DT,S_DT,P_DT,A_DT,f1_DT,l_DT = Rec,Spec,Pres,Accu,f1,logit_roc_auc
Decision Tree Classifier - Hyper Parameters Optimizaton
#Reduce Overfitting by Decision Tree Classifier - Hyper Parameters Optimizaton
clas_DTr = DecisionTreeClassifier()
#
params_DTr={
'criterion' :['gini','entropy'],
'max_depth' :[ 1, 3, 5, 7, 10, 13, 16],
'min_samples_split':[ 2, 3, 5, 7, 10],
'min_samples_leaf' :[ 1, 2, 3, 5, 7, 10],
}
#
random_searchDT=RandomizedSearchCV(clas_DTr,param_distributions=params_DTr,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchDT.fit(X_train, y_train);
random_searchDT.best_params_
# decision tree number
dTree_r = DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=10, min_samples_split=10, random_state=1 )
# calling score_confmetrics funtion
model = dTree_r
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_DTr, ste_DTr, TP_DTr, TN_DTr,FP_DTr, FN_DTr = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_DTr,S_DTr,P_DTr,A_DTr,f1_DTr,l_DTr = Rec,Spec,Pres,Accu,f1,logit_roc_auc
Visualization - Reducing over fitting
#Visualization - Reducing Overfitting
from sklearn.tree import export_graphviz
import io
from io import StringIO
from IPython.display import Image
import pydotplus
import graphviz
#
xvar = data.drop(columns=['Target'])
feature_cols = xvar.columns
train_char_label = ['No', 'Yes']
bank_tar = StringIO()
export_graphviz(dTree_r, out_file=bank_tar,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=train_char_label)
graph_bt = pydotplus.graph_from_dot_data(bank_tar.getvalue())
graph_bt.write_png('bank_target.png')
Image(graph_bt.create_png())
Ensemble learning - Bagging
#Bagging Classifier - Hyper Parameters Optimizaton
from sklearn.ensemble import BaggingClassifier
clas_B = BaggingClassifier()
#
params_B={
'n_estimators' :[10, 25, 50, 75, 100],
'max_samples' :[0.01, 0.1, 0.5, 0.75, 1.0],
'max_features' :[0.01, 0.1, 0.5, 0.75, 1.0]
}
#
random_searchB=RandomizedSearchCV(clas_B,param_distributions=params_B,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchB.fit(X_train, y_train);
random_searchB.best_params_
bgcl = BaggingClassifier(n_estimators=50, max_samples= 0.01,max_features= 0.75, random_state=1)
# calling score_confmetrics funtion
model = bgcl
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_bg, ste_bg, TP_bg, TN_bg,FP_bg, FN_bg = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_bg,S_bg,P_bg,A_bg,f1_bg,l_bg = Rec,Spec,Pres,Accu,f1,logit_roc_auc
Random Forest Classifier
#Random Forest Classifier - Hyper Parameters Optimizaton
from sklearn.ensemble import RandomForestClassifier
clas_RF = RandomForestClassifier()
#
params_RF={
'criterion' :['gini','entropy'],
'n_estimators' :[100, 250, 500, 750, 1000],
'max_depth' :[ 1, 3, 5, 7, 10, 13, 16],
'min_samples_split' :[2, 3, 5, 7, 10, 15],
'min_samples_leaf' :[1, 2, 3, 5, 7, 10, 15]
}
#
random_searchRF=RandomizedSearchCV(clas_RF,param_distributions=params_RF,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchRF.fit(X_train, y_train);
random_searchRF.best_params_
rfcl = RandomForestClassifier(criterion = 'entropy', n_estimators = 1000, max_depth=15,
min_samples_split=15, min_samples_leaf=2,random_state=1)
# calling score_confmetrics funtion
model = rfcl
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_rf, ste_rf, TP_rf, TN_rf,FP_rf, FN_rf = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_rf,S_rf,P_rf,A_rf,f1_rf,l_rf = Rec,Spec,Pres,Accu,f1,logit_roc_auc
AdaBossting
#AdaBoosting Classifier - Hyper Parameters Optimizaton
from sklearn.ensemble import AdaBoostClassifier
clas_AB = AdaBoostClassifier()
#
params_AB={
'n_estimators' :[50, 100, 150, 200, 250],
'learning_rate' :[0.1, 0.3, 0.5, 0.75, 1.0],
}
#
random_searchAB=RandomizedSearchCV(clas_AB,param_distributions=params_AB,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchAB.fit(X_train, y_train);
random_searchAB.best_params_
abcl = AdaBoostClassifier(n_estimators=150, learning_rate=0.3, random_state=1)
# calling score_confmetrics funtion
model = abcl
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_ab, ste_ab, TP_ab, TN_ab,FP_ab, FN_ab = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_ab,S_ab,P_ab,A_ab,f1_ab,l_ab = Rec,Spec,Pres,Accu,f1,logit_roc_auc
GradientBoost
#GradientBoost Classifier - Hyper Parameters Optimizaton
from sklearn.ensemble import GradientBoostingClassifier
clas_GB = GradientBoostingClassifier()
#
params_GB={
'n_estimators' :[100, 150, 200, 250],
'learning_rate' :[0.01, 0.05, 0.075, 0.1],
'max_depth' :[1,2,3],
'min_samples_split':[2,3,4,5],
'min_samples_leaf' :[2, 3, 5, 7, 10],
}
#
random_searchGB=RandomizedSearchCV(clas_GB,param_distributions=params_GB,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchGB.fit(X_train, y_train);
random_searchGB.best_params_
gbcl = GradientBoostingClassifier(n_estimators = 200, max_depth=1,
learning_rate=0.1, min_samples_leaf=2,
min_samples_split=2, random_state=1)
# calling score_confmetrics funtion
model = gbcl
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_gb, ste_gb, TP_gb, TN_gb,FP_gb, FN_gb = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_gb,S_gb,P_gb,A_gb,f1_gb,l_gb = Rec,Spec,Pres,Accu,f1,logit_roc_auc
XGBoost Classifier
#XGBoost Classifier - Hyper Parameters Optimizaton
import xgboost as xgb
classifier = xgb.XGBClassifier()
#
params={
"learning_rate" :[0.001, 0.05, 0.1, 0.5, 0.75, 0.1],
"max_depth" :[ 1, 2, 3],
"min_child_weight" :[ 1, 3, 5, 7],
"gamma" :[0.0, 0.1, 0.2, 0.3, 0.4],
"colsample_bytree" :[0.3, 0.4, 0.5, 0.7]
}
#
random_search=RandomizedSearchCV(classifier,param_distributions=params,
n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)
random_search.fit(X_train, y_train);
random_search.best_params_
xgb_m = xgb.XGBClassifier(min_child_weight= 5,
max_depth= 2,
learning_rate= 0.75,
gamma= 0.1,
colsample_bytree= 0.4, random_state=1)
#
# calling score_confmetrics funtion
model = xgb_m
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_xgb, ste_xgb, TP_xgb, TN_xgb,FP_xgb, FN_xgb = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_xgb,S_xgb,P_xgb,A_xgb,f1_xgb,l_xgb = Rec,Spec,Pres,Accu,f1,logit_roc_auc
LightGBM
#XLightGBM Classifier - Hyper Parameters Optimizaton
import lightgbm as lgbm
clas_LGB = lgbm.LGBMClassifier()
#
params_LGB={
'n_estimators' :[100, 150, 200, 250],
'learning_rate' :[0.02, 0.04, 0.06, 0.08, 0.1]
}
#
random_searchLGB=RandomizedSearchCV(clas_LGB,param_distributions=params_LGB,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3);
random_searchLGB.fit(X_train, y_train);
random_searchLGB.best_params_
lgbm_m = lgbm.LGBMClassifier(n_estimators = 150, learning_rate=0.04, random_state=1)
# calling score_confmetrics funtion
model = lgbm_m
score_confmetrics(model, X_train, y_train, X_test, y_test)
# Tranfering data
str_lgb, ste_lgb, TP_lgb, TN_lgb,FP_lgb, FN_lgb = sc_tra,sc_tes,TPf,TNf,FPf,FNf
R_lgb,S_lgb,P_lgb,A_lgb,f1_lgb,l_lgb = Rec,Spec,Pres,Accu,f1,logit_roc_auc
Summary of the Algoritms
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.DataFrame({'Method':['Logistic Regression','Decision Tree- Reduce Overfit',
'Bagging','RandomForest - Reduce Overfit',
'AdaBoosting','GradientBoost','XGBoost','LightGBM'],
'Score_Train': [str_LR,str_DTr,str_bg,str_rf,str_ab,str_gb,str_xgb,str_lgb],
'Score_Test': [ste_LR,ste_DTr,ste_bg,ste_rf,ste_ab,ste_gb,ste_xgb,ste_lgb],
'True Positive': [ TP_LR, TP_DTr, TP_bg, TP_rf, TP_ab, TP_gb, TP_xgb, TP_lgb],
'True Negative': [ TN_LR, TN_DTr, TN_bg, TN_rf, TN_ab, TN_gb, TN_xgb, TN_lgb],
'False Positive':[ FP_LR, FP_DTr, FP_bg, FP_rf, FP_ab, FP_gb, FP_xgb, FP_lgb],
'False Negative':[ FN_LR, FN_DTr, FN_bg, FN_rf, FN_ab, FN_gb, FN_xgb, FN_lgb],
'Recall': [ R_LR, R_DTr, R_bg, R_rf, R_ab, R_gb, R_xgb, R_lgb],
'Specifity': [ S_LR, S_DTr, S_bg, S_rf, S_ab, S_gb, S_xgb, S_lgb],
'Precision': [ P_LR, P_DTr, P_bg, P_rf, P_ab, P_gb, P_xgb, P_lgb],
'F1': [ f1_LR, f1_DTr, f1_bg, f1_rf, f1_ab, f1_gb, f1_xgb, f1_lgb],
'Area U_Curve': [ l_LR, l_DTr, l_bg, l_rf, l_ab, l_gb, l_xgb, l_lgb] })
resultsDf = resultsDf[['Method', 'Score_Train','Score_Test','True Positive','True Negative','False Positive',
'False Negative','Recall','Specifity','Precision','F1','Area U_Curve']]
resultsDf=resultsDf.sort_values(by=['Area U_Curve']).set_index('Method')
resultsDf
ax=resultsDf[['Score_Train','Score_Test']]
ax.plot(figsize=(15,6), rot=90, mark_right=True, linestyle='--', marker='o');
ax =resultsDf[['Recall','Precision','F1', 'Area U_Curve']]
ax.plot(figsize=(15,6), rot=90, mark_right=True, linestyle='--', marker='o');
ax = resultsDf[['True Positive','False Positive','False Negative']]
ax.plot(figsize=(15,5), rot=90, mark_right=True, linestyle='--', marker='o');
ax = resultsDf[['True Negative']]
ax.plot(figsize=(15,5), rot=90, mark_right=True, linestyle='--', marker='o', color='r');
resultsDf.describe()
Final Observations:
1.- The lowest two performace models: Bagging and RandomForest.
2.- the best two performance models: XGBoost and LightGBM (Best Algoritm).
3.- Algorit shows a better performance in this order: